2025-05-21-12-08
Language and Thought: The View from LLMs
Abstract
arXiv:2505.13561v1 Announce Type: new Abstract: Daniel Dennett speculated in Kinds of Minds 1996: "Perhaps the kind of mind you get when you add language to it is so different from the kind of mind you can have without language that calling them both minds is a mistake." Recent work in AI can be seen as testing Dennett's thesis by exploring the performance of AI systems with and without linguistic training. I argue that the success of Large Language Models at inferential reasoning, limited though it may be, supports Dennett's radical view about the effect of language on thought. I suggest it is the abstractness and efficiency of linguistic encoding that lies behind the capacity of LLMs to perform inferences across a wide range of domains. In a slogan, language makes inference computationally tractable. I assess what these results in AI indicate about the role of language in the workings of our own biological minds.
摘要
丹尼尔·丹尼特在1996年出版的《心灵种种》中提出猜想:'或许,添加语言能力的心灵与不具备语言能力的心灵差异如此之大,将两者统称为'心灵'可能是一种错误。'近期人工智能领域的研究可视为对丹尼特命题的检验,通过比较接受语言训练与未接受语言训练的AI系统表现来验证这一观点。本文认为,尽管存在局限性,但大型语言模型在推理任务上取得的成功支持了丹尼特关于语言对思维影响的激进主张。笔者认为,正是语言编码的抽象性和高效性,使得大语言模型能够跨领域进行推理。简言之,语言使推理在计算层面变得可行。最后,本文评估了这些AI研究成果对我们理解生物大脑中语言作用机制的启示。
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Abstract
arXiv:2505.13529v1 Announce Type: new Abstract: Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.
摘要
大型推理模型(LRMs)的最新进展在数学和逻辑推理方面展现出令人印象深刻的能力。然而,当前LRMs极少承认无知或回应"我不知道",反而经常在表现出不当自信的同时产生错误答案,这引发了对其事实可靠性的担忧。在本研究中,我们识别出两种由过度思考导致的病态推理模式——最后一刻猜测和二次思考螺旋,这些模式导致了过度自信的错误答案。为解决这些问题,我们提出BARREL这一新颖框架,旨在促进简洁且边界感知的事实推理。实验表明,BARREL训练将DeepSeek-R1-Distill-Llama-8B的可靠性从39.33%提升至61.48%,同时仍保持与基于R1生成推理数据微调的模型相当的准确度。这些结果证明,我们的初步研究对构建更可靠、更注重事实的系统2型LRMs具有启发意义。
Evaluating Large Language Models for Real-World Engineering Tasks
Abstract
arXiv:2505.13484v1 Announce Type: new Abstract: Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
摘要
大语言模型(LLMs)不仅对日常活动具有变革性影响,在工程任务中也展现出巨大潜力。然而,当前针对LLMs的工程能力评估存在两个关键缺陷:(1)依赖简化的用例,这些用例通常改编自易于验证正确性的考试材料;(2)采用临时性场景,未能充分捕捉关键工程能力。因此,LLMs在复杂现实工程问题上的表现仍属未知领域。本文通过构建一个精选数据库填补这一空白,该数据库包含100多个源自真实生产导向工程场景的问题,系统性地涵盖产品设计、预测与诊断等核心能力。基于该数据集,我们评估了四种最先进的LLMs(包括云端和本地部署实例),以系统研究其在复杂工程任务中的表现。结果表明,LLMs在基础时空推理和结构推理方面表现突出,但在抽象推理、形式化建模以及上下文敏感的工程逻辑方面存在显著不足。
Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer
Abstract
arXiv:2505.13489v1 Announce Type: new Abstract: Knowledge tracing (KT) aims to predict learners' future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners' knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners' interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model's ability to provide a more robust and accurate representation of learners' overall knowledge states.
摘要
知识追踪(KT)旨在基于历史学习交互预测学习者的未来表现。然而,现有KT模型主要关注单一课程数据,限制了其全面理解学习者知识状态的能力。本文提出TransKT——一种对比式跨课程知识追踪方法,通过概念图引导的知识迁移来建模不同课程间学习行为的关联,从而提升知识状态估计效果。具体而言,TransKT利用零样本大语言模型(LLM)提示构建跨课程概念图,建立不同课程相关概念间的隐含联系。该图作为知识迁移的基础,使模型能够整合并增强跨课程学习交互的语义特征。此外,TransKT采用LLM-to-LM管道融入语义特征摘要,显著提升了用于知识迁移的图卷积网络(GCN)性能。同时,该方法通过对比目标函数对齐单课程与跨课程知识状态,从而优化模型对学习者整体知识状态的表征能力,使其更具鲁棒性和准确性。